STA6235: Modeling in Regression
We have previously discussed continuous outcomes and the appropriate distributions.
Normal distribution
Gamma distribution
Let’s now consider categorical outcomes:
Binary
Ordinal
Multinomial
\ln \left( \frac{\pi}{1-\pi} \right) = \beta_0 + \beta_1 x_1 + ... + \beta_k x_k,
where \pi = \text{P}[Y = 1] = the probability of the outcome/event.
How is this different from linear regression? y = \beta_0 + \beta_1 x_1 + ... + \beta_k x_k
or Gamma regression?
\ln(y) = \beta_0 + \beta_1 x_1 + ... + \beta_k x_k
library(tidyverse)
richmondway <- read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2023/2023-09-26/richmondway.csv') %>%
mutate(dating = if_else(Dating_flag == "Yes", 1, 0),
IMDB = if_else(Imdb_rating >= 8.5, 1, 0)) %>%
select(Season, Episode, F_count_RK, F_perc, dating, IMDB)
# quantile(richmondway$Imdb_rating, c(0, 0.25, 0.5, 0.75, 1))
# richmondway %>% count(IMDB_9) m1 <- glm(dating ~ F_perc + IMDB,
data = richmondway,
family = "binomial"(link="logit"))
summary(m1)
Call:
glm(formula = dating ~ F_perc + IMDB, family = binomial(link = "logit"),
data = richmondway)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.76166 1.08995 -1.616 0.106
F_perc 0.03323 0.02506 1.326 0.185
IMDB 0.37986 0.72250 0.526 0.599
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 46.662 on 33 degrees of freedom
Residual deviance: 44.261 on 31 degrees of freedom
AIC: 50.261
Number of Fisher Scoring iterations: 4
The model is as follows,\ln \left( \frac{\hat{\pi}}{1-\hat{\pi}} \right) = -1.76 + 0.03 x_1 + 0.38 x_2, where
x_1 is the episode’s percentage of the F-bombs from Roy Kent
x_2 is the IMDB rating categorization of the episode
Recall the binary logistic regression model, \ln \left( \frac{\pi}{1-\pi} \right) = \beta_0 + \beta_1 x_1 + ... + \beta_k x_k,
We are modeling the log odds, which are not intuitive with interpretations.
To be able to discuss the odds, we will “undo” the natural log by exponentiation.
i.e., if we want to interpret the slope for x_i, we will look at e^{\hat{\beta}_i}.
When interpreting \hat{\beta}_i, it is an additive effect on the log odds.
When interpreting e^{\hat{\beta}_i}, it is a multiplicative effect on the odds.
\begin{align*} \ln \left( \frac{\pi}{1-\pi} \right) &= \beta_0 + \beta_1 x_1 + ... + \beta_k x_k \\ \exp\left\{ \ln \left( \frac{\pi}{1-\pi} \right) \right\} &= \exp\left\{ \beta_0 + \beta_1 x_1 + ... + \beta_k x_k \right\} \\ \frac{\pi}{1-\pi} &= e^{\beta_0} e^{\beta_1 x_1} \cdots e^{\beta_k x_k} \end{align*}
For continuous predictors:
For categorical predictors:
Let’s interpret the odds ratios:
For a 1 percentage point increase in the percentage of f-bombs that came from Roy Kent, the odds of Roy and Keeley dating increase by 3%.
As compared to when episodes have less than an IMDB rating of 8.5, the odds of Roy and Keeley dating are 46% higher in episodes with an IMDB rating of at least 8.5.
summary():
Call:
glm(formula = dating ~ F_perc + IMDB, family = binomial(link = "logit"),
data = richmondway)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.76166 1.08995 -1.616 0.106
F_perc 0.03323 0.02506 1.326 0.185
IMDB 0.37986 0.72250 0.526 0.599
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 46.662 on 33 degrees of freedom
Residual deviance: 44.261 on 31 degrees of freedom
AIC: 50.261
Number of Fisher Scoring iterations: 4
Using the full/reduced Partial F approach:
Note 1! I am showing this for demonstration purposes; we do not need a Partial F for this particular model.
Note 2! We must add test = "LRT" to the anova() function.
What we’ve learned so far re: significance of predictors holds true with logistic regression
The guidelines we’ve set up for data visualization still hold true.
We will put our outcome on the y-axis and a continuous (or at least ordinal) predictor on the x-axis.
Recall the logistic regression model, \ln \left( \frac{\pi_i}{1-\pi_i} \right) = \beta_0 + \beta_1 x_{1i} + ... + \beta_k x_{ki}
We can solve for the probability, which allows us to predict the probability that y_i=1 given the specified model: \pi_i = \frac{\exp\left\{ \beta_0 + \beta_1 x_{1i} + ... + \beta_k x_{ki} \right\}}{1 + \exp\left\{ \beta_0 + \beta_1 x_{1i} + ... + \beta_k x_{ki} \right\}}